200 research outputs found

    On-the-Fly Data Synopses: Efficient Data Exploration in the Simulation Sciences

    Get PDF

    QUASII: QUery-Aware Spatial Incremental Index.

    Get PDF
    With large-scale simulations of increasingly detailed models and improvement of data acquisition technologies, massive amounts of data are easily and quickly created and collected. Traditional systems require indexes to be built before analytic queries can be executed efficiently. Such an indexing step requires substantial computing resources and introduces a considerable and growing data-to-insight gap where scientists need to wait before they can perform any analysis. Moreover, scientists often only use a small fraction of the data - the parts containing interesting phenomena - and indexing it fully does not always pay off. In this paper we develop a novel incremental index for the exploration of spatial data. Our approach, QUASII, builds a data-oriented index as a side-effect of query execution. QUASII distributes the cost of indexing across all queries, while building the index structure only for the subset of data queried. It reduces data-to-insight time and curbs the cost of incremental indexing by gradually and partially sorting the data, while producing a data-oriented hierarchical structure at the same time. As our experiments show, QUASII reduces the data-to-insight time by up to a factor of 11.4x, while its performance converges to that of the state-of-the-art static indexes

    Towards batch-processing on cold storage devices

    Get PDF
    Large amounts of data in storage systems is cold, i.e., Written Once and Read Occasionally (WORO). The rapid growth of massive-scale archival and historical data increases the demand for petabyte-scale cheap storage for such cold data. A Cold Storage Device (CSD) is a disk-based storage system which is designed to trade off performance for cost and power efficiency. Inevitably, the design restrictions used in CSD's results in performance limitations. These limitations are not a concern for WORO workloads, however, the very low price/performance characteristics of CSDs makes them interesting for other applications, e.g., batch processes, too. Applications, however, can be very slow on CSD's if they do not take their characteristics into account. In this paper we design two strategies for data partitioning in CSDs -- a crucial operation in many batch analytics tasks like hash-join, near-duplicate detection, and data localization. We show that our strategies can efficiently use CSDs for batch processing of terabyte-scale data by accelerating data partitioning by 3.5x in our experiments

    AKARI/IRC Broadband Mid-infrared data as an indicator of Star Formation Rate

    Full text link
    AKARI/Infrared Camera (IRC) Point Source Catalog provides a large amount of flux data at {\it S9W} (9 μm9\ {\rm \mu m}) and {\it L18W} (18 μm18\ {\rm \mu m}) bands. With the goal of constructing Star-Formation Rate(SFR) calculations using IRC data, we analyzed an IR selected GALEX-SDSS-2MASS-AKARI(IRC/Far-Infrared Surveyor) sample of 153 nearby galaxies. The far-infrared fluxes were obtained from AKARI diffuse maps to correct the underestimation for extended sources raised by the point-spread function photometry. SFRs of these galaxies were derived by the spectral energy distribution fitting program CIGALE. In spite of complicated features contained in these bands, both the {\it S9W} and {\it L18W} emission correlate with the SFR of galaxies. The SFR calibrations using {\it S9W} and {\it L18W} are presented for the first time. These calibrations agree well with previous works based on Spitzer data within the scatters, and should be applicable to dust-rich galaxies.Comment: PASJ, in pres

    The spectral energy distribution of galaxies at z > 2.5: Implications from the Herschel/SPIRE color-color diagram

    Full text link
    We use the Herschel SPIRE color-color diagram to study the spectral energy distribution (SED) and the redshift estimation of high-z galaxies. We compiled a sample of 57 galaxies with spectroscopically confirmed redshifts and SPIRE detections in all three bands at z=2.56.4z=2.5-6.4, and compared their average SPIRE colors with SED templates from local and high-z libraries. We find that local SEDs are inconsistent with high-z observations. The local calibrations of the parameters need to be adjusted to describe the average colors of high-z galaxies. For high-z libraries, the templates with an evolution from z=0 to 3 can well describe the average colors of the observations at high redshift. Using these templates, we defined color cuts to divide the SPIRE color-color diagram into different regions with different mean redshifts. We tested this method and two other color cut methods using a large sample of 783 Herschel-selected galaxies, and find that although these methods can separate the sample into populations with different mean redshifts, the dispersion of redshifts in each population is considerably large. Additional information is needed for better sampling.Comment: 17 pages, 14 figures, accepted for publication in A&

    TRANSFORMERS: Robust spatial joins on non-uniform data distributions

    Get PDF
    Spatial joins are becoming increasingly ubiquitous in many applications, particularly in the scientific domain. While several approaches have been proposed for joining spatial datasets, each of them has a strength for a particular type of density ratio among the joined datasets. More generally, no single proposed method can efficiently join two spatial datasets in a robust manner with respect to their data distributions. Some approaches do well for datasets with contrasting densities while others do better with similar densities. None of them does well when the datasets have locally divergent data distributions. In this paper we develop TRANSFORMERS, an efficient and robust spatial join approach that is indifferent to such variations of distribution among the joined data. TRANSFORMERS achieves this feat by departing from the state-of-the-art through adapting the join strategy and data layout to local density variations among the joined data. It employs a join method based on data-oriented partitioning when joining areas of substantially different local densities, whereas it uses big partitions (as in space-oriented partitioning) when the densities are similar, while seamlessly switching among these two strategies at runtime. We experimentally demonstrate that TRANSFORMERS outperforms state-of-the-art approaches by a factor of between 2 and 8

    Space odyssey: efficient exploration of scientific data.

    Get PDF
    Advances in data acquisition---through more powerful supercomputers for simulation or sensors with better resolution---help scientists tremendously to understand natural phenomena. At the same time, however, it leaves them with a plethora of data and the challenge of analysing it. Ingesting all the data in a database or indexing it for an efficient analysis is unlikely to pay off because scientists rarely need to analyse all data. Not knowing a priori what parts of the datasets need to be analysed makes the problem challenging. Tools and methods to analyse only subsets of this data are rather rare. In this paper we therefore present Space Odyssey, a novel approach enabling scientists to efficiently explore multiple spatial datasets of massive size. Without any prior information, Space Odyssey incrementally indexes the datasets and optimizes the access to datasets frequently queried together. As our experiments show, through incrementally indexing and changing the data layout on disk, Space Odyssey accelerates exploratory analysis of spatial data by substantially reducing query-to-insight time compared to the state of the art

    Data Infrastructure for Medical Research

    Get PDF
    While we are witnessing rapid growth in data across the sciences and in many applications, this growth is particularly remarkable in the medical domain, be it because of higher resolution instruments and diagnostic tools (e.g. MRI), new sources of structured data like activity trackers, the wide-spread use of electronic health records and many others. The sheer volume of the data is not, however, the only challenge to be faced when using medical data for research. Other crucial challenges include data heterogeneity, data quality, data privacy and so on. In this article, we review solutions addressing these challenges by discussing the current state of the art in the areas of data integration, data cleaning, data privacy, scalable data access and processing in the context of medical data. The techniques and tools we present will give practitioners — computer scientists and medical researchers alike — a starting point to understand the challenges and solutions and ultimately to analyse medical data and gain better and quicker insights

    Nanopore sequencing simulator for DNA data storage

    Get PDF
    The exponential increase of digital data and the limited capacity of current storage devices have made clear the need for exploring new storage solutions. Thanks to its biological properties, DNA has proven to be a potential candidate for this task, allowing the storage of information at a high density for hundreds or even thousands of years. With the release of nanopore sequencing technologies, DNA data storage is one step closer to become a reality. Many works have proposed solutions for the simulation of this sequencing step, aiming to ease the development of algorithms addressing nanopore-sequenced reads. However, these simulators target the sequencing of complete genomes, whose characteristics differ from the ones of synthetic DNA. This work presents a nanopore sequencing simulator targeting synthetic DNA on the context of DNA data storage

    eTRIKS Analytical Environment: A Modular High Performance Framework for Medical Data Analysis

    Get PDF
    Translational research is quickly becoming a science driven by big data. Improving patient care, developing personalized therapies and new drugs depend increasingly on an organization's ability to rapidly and intelligently leverage complex molecular and clinical data from a variety of large-scale partner and public sources. As analysing these large-scale datasets becomes computationally increasingly expensive, traditional analytical engines are struggling to provide a timely answer to the questions that biomedical scientists are asking. Designing such a framework is developing for a moving target as the very nature of biomedical research based on big data requires an environment capable of adapting quickly and efficiently in response to evolving questions. The resulting framework consequently must be scalable in face of large amounts of data, flexible, efficient and resilient to failure. In this paper we design the eTRIKS Analytical Environment (eAE), a scalable and modular framework for the efficient management and analysis of large scale medical data, in particular the massive amounts of data produced by high-throughput technologies. We particularly discuss how we design the eAE as a modular and efficient framework enabling us to add new components or replace old ones easily. We further elaborate on its use for a set of challenging big data use cases in medicine and drug discovery
    corecore